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ABSTRACT 

Summary: Analysing large amounts of data generated by next- 
generation sequencing (NGS) technologies is difficult for researchers 
or clinicians without computational skills. They are often compelled 
to delegate this task to computer biologists working with command 
line utilities. The availability of easy-to-use tools will become essential 
with the generalization of NGS in research and diagnosis. It will 
enable investigators to handle much more of the analysis. Here, 
we describe Knime4Bio, a set of custom nodes for the KNIME (The 
Konstanz Information Miner) interactive graphical workbench, for the 
interpretation of large biological datasets. We demonstrate that this 
tool can be utilized to quickly retrieve previously published scientific 
findings. 
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the scientists themselves for building complex analyses. They allow 

data repro ducibility and workflows sharing. 

Galax y telankenberg etal[\20lA Cyrille2 (iFiers etallEmh and 
Mobyle (iNron et a/.Ll2009h are three web-based workflow engines 
that users have to install locally if computational needs on datasets 
are very large, or if absolute security is requ i red. A lternatively, 
softwares s uch as the KNIME dBerthold et all l20Q7b workbench 
or Taverna (iHull et all I2006I) run on the users' desktop and can 
interact with local resources. Taverna focuses on web services and 
may require a large number of nodes even for a simple task. In 
contrast, KNIME provides the ability to modify the nodes without 
having to re-run the whole analysis. We have chosen this latest tool 
to develop Knime4Bio, a set of new nodes mostly dedicated to the 
filtering and manipulation of VCF files. Although many standard 
nodes provided by KNIME can be used to perform such analysis, 
our nodes add new functionalities, some of which are described 
below. 



1 INTRODUCTION 

Next-generation sequencing (NGS) technologies have led 
to an explosion of the amount of data to be analysed. As 
an example, a VCF toanecek" et all l201ll) file (Variant 
Call Format — a standard specification for storing genomic 
variations in a text file) produced by the 1000 Genomes Project 
contains about 25 million Single Nucleotide Variants (SNV), 
|http://tinyurl.com/ALL2of4intersectionl (retrieved September 
2011)], making it difficult to extract relevant information using 
spreadsheet programs. While computer biologists are used to 
invoke common command line tools — such as Perl and R — when 
analysing those data through Unix pipelines, scientific investigators 
generally lack the technical skills necessary to handle these tools 
and need to delegate data manipulation to a third party. 

Scientific workflow and data integration platforms aim to make 
those tasks more accessible to those research scientists. These tools 
are modular environments enabling an easy visual assembly and an 
interactive execution of an analysis pipeline (typically a directed 
graph) where a node defines a task to be executed on input data 
and an edge between two nodes represents a data flow. These 
applications provide an intuitive framework that can be used by 
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2 IMPLEMENTATION 

The java API for KNIME was used to write the new nodes, 
which were deployed and documented using some dedicated XML 
descriptors. Atypical workflow for analysing exome sequencing data 
starts by loading VCF files into the working environment. The data 
contained in the INFO or the SAMPLE columns are extracted and 
the next task consists in annotating SNVs and/or indels. One node 
predicts the consequence of variations at the transcript/protein level. 
For each variant, genomic sequences of overlappin g transcripts 
are retrieved from the UCSC knownGene database (iHsu et all 
120061) to identify variants leading to premature stop codons, non- 
synonymous variants and variants likely to affect splicing. Some 
nodes have been designed to find the intersection between the 
variants in the VCF file and a various source of annotated genomic 
regions, which can be: a local BE D file, a remote URL, a mysql 
tabl e, a file indexed with tabix Q l201ll) . a BigBed or a BigWig 
file JKent^a/lEoiol) . Othe r nodes are able t o incorporate data from 
other databases: dbSNFRP (iLiu et a/lEoTH) . dbSNP, Entrez Gene, 
PubMed, the EMBL STRING database , Uniprot, Reactome and 
GeneOntology (I von Mering et al L 20071). M ediaWiki. or to export 
the data to SIFT (iNg and Henikof i l200lK Polvphen2 fA dzhubei 
et aL l2010l) . BED or MediaWiki formats. After being annotated, 
some SNVs (e.g. intronic) can be excluded from the dataset and the 
remaining data are rearranged by grouping the variants per sample 
or per gene as a pivot table. Some visualizat i on too ls have also 
been implemented: the Picard API dLi et all l2009h or the IGV 
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Fig. 1. Screenshot of a Knime4Bio workflow for the NOTCH2 analysis. 



browser feobinson et a/lEoiH) can be used visualize the short reads 
overlapping a variation. 

As a proof of concept, we tested our nodes to analyse the exomes 
of six patients from a previously published study dlsidor <3/ll201ll) 
related to the Hajdu Cheney syndrome (Fig. [TJ. For this purpose, 
short reads were mapped to t he hu man genome reference sequence 
usms BWA dLi and DurbinL Eoiol) and variants were called using 
SAMtools mpileup dLi et al 112009b- Homozygous variants, known 
SNPs (from dbSNP) and poor-quality variants were discarded, 
and only non- synonymous and variants introducing premature stop 
codons were considered. On a RedHat server (64 bits, 4 processors, 

2 GB of RAM), our KNIME pipeline generated a list of six genes in 
45 rmn ACELSRR\COUA2\\MAGEFl\\MYO^ more 
importantly \NOTCH2\ the expected candidate geneQ 

3 DISCUSSION 

In practical terms, a computer biologist was close to our users to help 
them with the construction of a workflow. After this short tutorial, 
they were able to quickly play with the interface, add some nodes 
and modify the parameters without any further assistance, but the 
suggestion or the configuration of some specific nodes (for example, 
those who require a snippet of java code). At the time of writing, 
Knime4Bio contains 55 new nodes. We believe Knime4Bio is an 
efficient interactive tool for NGS analysis. 
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